We have been constantly told this statement “Computers don’t lie”. Yes in fact Computers don’t lie, but neither does it speak the truth. A computer does what its Master programs it to do. Similarly, A model wouldn’t lie unless the Machine Learning Engineer doesn’t want it to lie.
There was a nice episode of the podcast You are not so smart came out last year. This is an excerpt from it:
“I want a machine-learning algorithm to learn what tumors looked like in the past, and I want it to become biased toward selecting those kind of tumors in the future,” explains philosopher Shannon Vallor at Santa Clara University. “But I don’t want a machine-learning algorithm to learn what successful engineers and doctors looked like in the past and then become biased toward selecting those kinds of people when sorting and ranking resumes.”
Machine Bias can occur due to a lot of factors but a few to name is:
Below is an example of how Google Translate, when translated the following text to a Gender-neutral langauge and back to English - applies its bias (primarily due to the nature of biased Training Dataset)
img
The first step of finding solution to any problem is accepting The Problem exists. Let’s accept that fact and see how to use Kaggle Survey results and help the community tackle Machine Bias.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(highcharter))
suppressPackageStartupMessages(library(DataExplorer))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(viridis))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(RColorBrewer))
pal <- brewer.pal(9,"BuGn")
plotting_missing <- function(df){
#based on erikbruin's code snippet
NAcol <- which(colSums(is.na(df)) > 0)
NAcount <- sort(colSums(sapply(df[NAcol], is.na)), decreasing = TRUE)
NADF <- data.frame(variable=names(NAcount), missing=NAcount)
NADF$PctMissing <- round(((NADF$missing/nrow(df))*100),1)
NADF %>%
ggplot(aes(x=reorder(variable, PctMissing), y=PctMissing)) +
geom_bar(stat='identity', fill='red') + coord_flip(y=c(0,110)) +
labs(x="", y="Percent missing") +
geom_text(aes(label=paste0(NADF$PctMissing, "%"), hjust=-0.1))
}The above plot is to demonstrate how much these questions that are about Model Fairness / Bias, have been ignored.
While asking about Salary made 15% of respondents to not answer, Questions about Reproducibility, Explainability and Bias made 37% of respondents to skip answering. The salary question comparsion is here to show relatively worse questions like this are approached.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"-","\n")) %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-|\\:"),"")) %>%
group_by(key) %>%
count(value) %>%
drop_na() %>%
mutate(n = n / sum(n)) %>%
ggplot() + geom_col(aes(value,n, fill = key), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = value, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
facet_wrap(~key) +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Perception on Reproducibility, Explainability and Model Bias ",
subtitle = "Percentage",
x = "Selected Options",
y = "Percentage of Respondents (other than NAs)")To get a better perspective of the volume of the respondents, below is the same plot as above but with absolute numbers of respondents and their options.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"-","\n")) %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-|\\:"),"")) %>%
group_by(key) %>%
count(value) %>%
drop_na() %>%
ggplot() + geom_col(aes(value,n, fill = key), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = value, y = n + 20, label = n),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
facet_wrap(~key) +
#scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Perception on Reproducibility, Explainability and Model Bias ",
subtitle = "Absolute Numbers",
x = "Selected Options",
y = "Percentage of Respondents (other than NAs)")Fairness and Bias:
Only close to half (57.4%) of the respondenrts who chose to answer consider Fairness and Bias in ML Algorithm is a Very important.
This is the lowest Very important sentiment echoed by the community of all the 3 questions.
3.6% of those who chose to respondent perceive this is Not at all important, which is the highest Not at all important feeling expressed of all 3 questions.
survey %>% select(contains("How do you perceive the importance of the following topics?")) %>%
gather() %>%
mutate(key = str_replace(key,"How do you perceive the importance of the following topics?",""),
key = str_replace(key, regex("\\?"),""),
key = str_replace(key, regex("\\-"),"")) %>%
group_by(key) %>%
count(value) %>%
mutate(n = percent(n / sum(n))) %>%
spread(value,n) %>%
knitr::kable()| key | No opinion; I do not know | Not at all important | Slightly important | Very important | |
|---|---|---|---|---|---|
| Being able to explain ML model outputs and/or predictions | 2.9% | 1.6% | 17.0% | 41.1% | 37.4% |
| Fairness and bias in ML algorithms: | 5.4% | 2.3% | 19.0% | 36.0% | 37.4% |
| Reproducibility in data science | 3.8% | 1.0% | 14.9% | 42.9% | 37.4% |
They refer to those beings who think Fairness and Bias in ML Alggorithm are Very Important in Machine Learning.
they %>% group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Gender" = `What is your gender? - Selected Choice`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Gender,n, fill = Gender), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Gender, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Gender",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>% group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Gender" = `What is your gender? - Selected Choice`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Gender,n, fill = Gender), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Gender, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Gender",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)Let us a create a new KPI index called They - Not They Ratio to give a different perspective to this comparison.
they %>% filter(!`What is your gender? - Selected Choice` %in% c("Other")) %>%
group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree" = `What is your gender? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% filter(!`What is your gender? - Selected Choice` %in% c("Other")) %>%
group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree1" = `What is your gender? - Selected Choice`,
"Not They" = n) %>%
select(-Degree1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("column",hcaes("Degree","T_NT_Ratio")) %>%
hc_title(text = "Gender-wise preference - Model Bias and Fairness ") %>%
hc_add_theme(hc_theme_538()) There is a difference of 5.1 PP Female Percentage difference between those who perceive Model Fariness & Bias in ML is Very Important and Others.
While this could be seen as that Female Gender usually gets affected by these Biases, It’s also important to realize that Male Gender (Kaggler’s) don’t echo similar sentiment as their female counterpart. After all, A healthy model is what we all want, don’t we?
Using the index that we created T_NT_Ratio, we can see that Female gender and those who selected other options other than Male are above on top of Male Kagglers in their perception about Model Bias and Fairness
they %>% group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age" = `What is your age (# years)?`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>% group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age" = `What is your age (# years)?`) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)Age doesn’t seem to give anything straightway, which probably could be due to a lot of different age brackets. Let us try a bit of engineering to club them into two groups < 30 and > 30.
they %>%
mutate(age_grp = ifelse(parse_number(`What is your age (# years)?`) < 30,
"Less than 30",
"30+")) %>%
group_by(age_grp) %>% count() %>% ungroup() %>%
rename("Age" = age_grp) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(age_grp = ifelse(parse_number(`What is your age (# years)?`) < 30,
"Less than 30",
"30+")) %>%
group_by(age_grp) %>% count() %>% ungroup() %>%
rename("Age" = age_grp) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Age,n, fill = Age), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Age, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "E") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Age",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)This plot helps us say that the younger ones need to be updated with the implications of Model Bias and Fairness more than their older counterparts. That leads us to another important section of what they do.
they %>%
mutate(title = ifelse(`Select the title most similar to your current role (or most recent title if retired): - Selected Choice` == "Student",
"Student",
"Professional")) %>%
group_by(title) %>% count() %>% ungroup() %>%
rename("Title" = title) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Title,n, fill = Title), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Title, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "C") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(title = ifelse(`Select the title most similar to your current role (or most recent title if retired): - Selected Choice` == "Student",
"Student",
"Professional")) %>%
group_by(title) %>% count() %>% ungroup() %>%
rename("Title" = title) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(Title,n, fill = Title), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = Title, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "C") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)they %>%
mutate(UG = ifelse(`Which best describes your undergraduate major? - Selected Choice` %in% c("Computer science (software engineering, etc.)","Information technology, networking, or system administration"),
"CS",
"Non_CS")) %>%
group_by(UG) %>% count() %>% ungroup() %>%
rename("UG" = UG) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(UG,n, fill = UG), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = UG, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p1
not_they %>%
mutate(UG = ifelse(`Which best describes your undergraduate major? - Selected Choice` %in% c("Computer science (software engineering, etc.)","Information technology, networking, or system administration"),
"CS",
"Non_CS")) %>%
group_by(UG) %>% count() %>% ungroup() %>%
rename("UG" = UG) %>%
mutate(n = n / sum(n),
perc = percent(n)) %>%
ggplot() + geom_col(aes(UG,n, fill = UG), stat = "identity", show.legend = FALSE) +
geom_label(aes(x = UG, y = n - 0.05, label = percent(n)),
# hjust=0, vjust=0, size = 4, colour = 'black',
fontface = 'bold') +
scale_fill_viridis(discrete = T, option = "D") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
theme(axis.text = element_text(angle = 45, size = 6)) +
labs(title = "Not They",
subtitle = "Perception on Fairness and Model Bias ",
x = "Title",
y = "Percentage of Respondents (other than NAs)") -> p2
cowplot::plot_grid(p1,p2)they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang" = `What specific programming language do you use most often? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang1" = `What specific programming language do you use most often? - Selected Choice`,
"Not They" = n) %>%
select(-Lang1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("bar",hcaes("Lang","T_NT_Ratio")) %>%
hc_title(text = "Language ordered by They-NotThey Ratio") %>%
hc_add_theme(hc_theme_538()) they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country" = `In which country do you currently reside?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country1" = `In which country do you currently reside?`,
"Not They" = n) %>%
select(-Country1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Country","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Country-wise") %>%
hc_add_theme(hc_theme_538())they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country" = `In which country do you currently reside?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In which country do you currently reside?`) %>% count() %>% ungroup() %>%
rename("Country1" = `In which country do you currently reside?`,
"Not They" = n) %>%
select(-Country1)) %>%
filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Country","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Country-wise > 100 respondents") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("your favorite media sources that report on data science topics")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:22, key = "questions", value = "media") %>%
#tidyr::replace_na() %>%
group_by(media) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("your favorite media sources that report on data science topics")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:22, key = "questions", value = "media") %>%
group_by(media) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-media1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
drop_na() %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("media","T_NT_Ratio")) %>%
hc_title(text = " Model Bias and Fairness - Media Sources - They vs Not They Ratio") %>%
hc_add_theme(hc_theme_538())The above plot makes it very clear that Podcasts like Partiall Derivative, Data Skpetic and Linear Digression are places where Kagglers who strongly believe that ML Bias and Fairness get their news from.
Nate Silver’s 538 makes an entry in the top 5 media sources arranged by T_NT_Ratio KPI
Kaggle Forums seemed to need to make more work in terms of encouraging or initiating discussions about ML Bias and Fairness
they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry1" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"Not They" = n) %>%
select(-Industry1)) %>%
#filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
filter(!Industry %in% c("3","Other")) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Industry","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Industry-wise") %>%
hc_add_theme(hc_theme_538())Kagglers in Industries like Non_Profit/Service and Government/Public Service have been better perception about the importance of Model Fairness and Bias.
It’s also unhealthy to see places like Military and Internet-based Services falling behind as those are the places where the model evaluation is crucial and can have serious outcomes.
they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`Do you consider yourself to be a data scientist?`) %>% count() %>% ungroup() %>%
rename("DS" = `Do you consider yourself to be a data scientist?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`Do you consider yourself to be a data scientist?`) %>% count() %>% ungroup() %>%
rename("DS1" = `Do you consider yourself to be a data scientist?`,
"Not They" = n)# %>%
# select(-DS1)
) %>%
#filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
filter(!DS %in% c("3","Other")) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("DS","T_NT_Ratio")) %>%
hc_title(text = "They vs Not They Ratio - Based on if they consider themsevles a Data Scientist") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("types of data")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:11, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("types of data")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:11, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-DataType1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("DataType","T_NT_Ratio")) %>%
hc_title(text = "Type of Data - They vs Not They Ratio ") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("online platforms have you begun or completed data science courses")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:13, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("online platforms have you begun or completed data science courses")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:13, key = "questions", value = "DataType") %>%
group_by(DataType) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-DataType1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("column",hcaes("DataType","T_NT_Ratio", color = "T_NT_Ratio")) %>%
hc_title(text = " MOOC - Online course platform - They vs Not They Ratio") %>%
hc_add_theme(hc_theme_538())survey %>%
select(`Approximately what percent of your data projects involved exploring unfair bias in the dataset and/or algorithm?`) %>%
drop_na() %>%
group_by(`Approximately what percent of your data projects involved exploring unfair bias in the dataset and/or algorithm?`) %>%
rename("Percentage" = `Approximately what percent of your data projects involved exploring unfair bias in the dataset and/or algorithm?`) %>%
count() %>%
hchart("area",hcaes("Percentage","n")) %>%
hc_title(text = "What percent of your data projects involved exploring unfair bias ") %>%
hc_add_theme(hc_theme_538()) survey %>%
dplyr::select(contains("not your models were successful")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:5, key = "questions", value = "Metric") %>%
#tidyr::replace_na() %>%
group_by(Metric) %>%
count() %>%
drop_na() %>%
hchart("column",hcaes("Metric","n")) %>%
hc_title(text = " Metrics used in organizations to determine whether or not your models were successful") %>%
hc_add_theme(hc_theme_darkunica())Model Accuracy metrics are considered to determine whether a model is successful or notRevenue / Business Goals has almost 2X more votes than Metri s that consider unfair Biassurvey %>%
dplyr::select(contains("difficult about ensuring that your algorithms are fair and unbiased")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:6, key = "questions", value = "Difficulty") %>%
#tidyr::replace_na() %>%
group_by(Difficulty) %>%
count() %>%
drop_na() %>%
hchart("bar",hcaes("Difficulty","n")) %>%
hc_title(text = "Most difficult about ensuring that your algorithms are fair and unbiased?") %>%
hc_add_theme(hc_theme_darkunica())They refer to those beings who think Being able to explain ML model outputs and/or predictions are Very Important in Machine Learning and Not They are those who think otherwise including - Somewhat Important, Not at all important and similar.
they <- survey %>% filter(`How do you perceive the importance of the following topics? - Being able to explain ML model outputs and/or predictions` == "Very important")
not_they <- survey %>% filter(`How do you perceive the importance of the following topics? - Being able to explain ML model outputs and/or predictions` != "Very important")they %>% filter(!`What is your gender? - Selected Choice` %in% c("Other")) %>%
group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree" = `What is your gender? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% filter(!`What is your gender? - Selected Choice` %in% c("Other")) %>%
group_by(`What is your gender? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree1" = `What is your gender? - Selected Choice`,
"Not They" = n) %>%
select(-Degree1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("column",hcaes("Degree","T_NT_Ratio")) %>%
hc_title(text = "Gender-wise preference - Interpretable Machine Learning ") %>%
hc_add_theme(hc_theme_538()) they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age" = `What is your age (# years)?`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What is your age (# years)?`) %>% count() %>% ungroup() %>%
rename("Age1" = `What is your age (# years)?`,
"Not They" = n) %>%
select(-Age1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("area",hcaes("Age","T_NT_Ratio")) %>%
hc_title(text = "Age Range - Interpretable Machine Learning ") %>%
hc_add_theme(hc_theme_538()) they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang" = `What specific programming language do you use most often? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`What specific programming language do you use most often? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Lang1" = `What specific programming language do you use most often? - Selected Choice`,
"Not They" = n) %>%
select(-Lang1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("bar",hcaes("Lang","T_NT_Ratio")) %>%
hc_title(text = "Language - Interpretable Machine Learning ") %>%
hc_add_theme(hc_theme_538()) Unexpectedly, SAS/STATA tops the index of They-Not_they ratio followed by R and MATLAB - making these three language users on Kaggle believe Model Intepretability is very important.
Python users need to be made aware of IML as Python still catches up behind SQL and Julia.
they %>% filter(!`Which best describes your undergraduate major? - Selected Choice` %in% c("Other")) %>%
group_by(`Which best describes your undergraduate major? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree" = `Which best describes your undergraduate major? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% filter(!`Which best describes your undergraduate major? - Selected Choice` %in% c("Other")) %>%
group_by(`Which best describes your undergraduate major? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Degree1" = `Which best describes your undergraduate major? - Selected Choice`,
"Not They" = n) %>%
select(-Degree1)) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("column",hcaes("Degree","T_NT_Ratio")) %>%
hc_title(text = "Degree (Education) - Interpretable Machine Learning ") %>%
hc_add_theme(hc_theme_538()) they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"They" = n) %>%
bind_cols(
not_they %>% #filter(`In which country do you currently reside?` %in% c("India","United States of America")) %>%
group_by(`In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`) %>% count() %>% ungroup() %>%
rename("Industry1" = `In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice`,
"Not They" = n) %>%
select(-Industry1)) %>%
#filter((They + `Not They`) > 100) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
filter(!Industry %in% c("3","Other")) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("Industry","T_NT_Ratio")) %>%
hc_title(text = "IML - They vs Not They Ratio - Industry-wise") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("online platforms have you begun or completed data science courses")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:13, key = "questions", value = "DataType") %>%
#tidyr::replace_na() %>%
group_by(DataType) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("online platforms have you begun or completed data science courses")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:13, key = "questions", value = "DataType") %>%
group_by(DataType) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-DataType1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("area",hcaes("DataType","T_NT_Ratio")) %>%
hc_title(text = " IML - MOOC - Online course platform - They vs Not They Ratio") %>%
hc_add_theme(hc_theme_538())they %>%
dplyr::select(contains("your favorite media sources that report on data science topics")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:22, key = "questions", value = "media") %>%
#tidyr::replace_na() %>%
group_by(media) %>%
count() %>%
rename("They" = n) %>%
bind_cols(
not_they %>%
dplyr::select(contains("your favorite media sources that report on data science topics")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:22, key = "questions", value = "media") %>%
group_by(media) %>%
count() %>%
rename("Not They" = n)
) %>%
dplyr::select(-media1) %>%
mutate("T_NT_Ratio" = round(They/`Not They`,3)) %>%
drop_na() %>%
arrange(desc(T_NT_Ratio)) %>%
hchart("line",hcaes("media","T_NT_Ratio")) %>%
hc_title(text = " IML - Media Sources - They vs Not They Ratio") %>%
hc_add_theme(hc_theme_538())T_NT_Ratio making them one of the top media sources They get news from the mostsurvey %>%
select(`Approximately what percent of your data projects involve exploring model insights?`) %>%
drop_na() %>%
group_by(`Approximately what percent of your data projects involve exploring model insights?`) %>%
rename("Percentage" = `Approximately what percent of your data projects involve exploring model insights?`) %>%
count() %>%
hchart("area",hcaes("Percentage","n")) %>%
hc_title(text = "What percent of your data projects involve exploring model insights?") %>%
hc_add_theme(hc_theme_538()) survey %>%
select(`Do you consider ML models to be \"black boxes\" with outputs that are difficult or impossible to explain?`) %>%
drop_na() %>%
group_by(`Do you consider ML models to be \"black boxes\" with outputs that are difficult or impossible to explain?`) %>%
rename("Perspective" = `Do you consider ML models to be \"black boxes\" with outputs that are difficult or impossible to explain?`) %>%
count() %>%
hchart("bar",hcaes("Perspective","n")) %>%
hc_title(text = 'ML models to be \"black boxes\" with outputs that are difficult or impossible to explain?') %>%
hc_add_theme(hc_theme_538()) The above plot tells us that Most people are confident that they can understand and explain the outputs of Many ML models but not all ML models. In fact, those who feel - Most ML models are Black Boxes are more than those who don’t have any opininon on this matter.
survey %>%
dplyr::select(contains("circumstances would you explore model insights and interpret")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:6, key = "questions", value = "circumstances") %>%
#tidyr::replace_na() %>%
group_by(circumstances) %>%
count() %>%
drop_na() %>%
hchart("bar",hcaes("circumstances","n")) %>%
hc_title(text = "Circumstances where exploring model insights and interpreting model happens") %>%
hc_add_theme(hc_theme_darkunica())survey %>%
dplyr::select(contains("interpreting decisions that are made by ML models")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:15, key = "questions", value = "methods") %>%
#tidyr::replace_na() %>%
group_by(methods) %>%
count() %>%
drop_na() %>%
hchart("bar",hcaes("methods","n")) %>%
hc_title(text = " Preferred explaining and/or interpreting decisions that are made by ML models") %>%
hc_add_theme(hc_theme_darkunica())survey %>%
select(`What methods do you prefer for explaining and/or interpreting decisions that are made by ML models? (Select all that apply) - Other - Text`) %>%
drop_na() %>%
rename("text" = `What methods do you prefer for explaining and/or interpreting decisions that are made by ML models? (Select all that apply) - Other - Text`) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n)) %>%
with(wordcloud(word, n, max.words = 100, colors = pal))## Joining, by = "word"
survey %>%
dplyr::select(contains("tools and methods do you use to make your work easy to reproduce")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:11, key = "questions", value = "methods") %>%
#tidyr::replace_na() %>%
group_by(methods) %>%
count() %>%
drop_na() %>%
hchart("bar",hcaes("methods","n")) %>%
hc_title(text = "Tools and methods do you use to make your work easy to reproduce") %>%
hc_add_theme(hc_theme_darkunica())survey %>%
dplyr::select(contains(" barriers prevent you from making your work even easier to reuse and reproduce")) %>%
dplyr::select(-contains("Text")) %>%
gather(1:8, key = "questions", value = "barriers") %>%
#tidyr::replace_na() %>%
group_by(barriers) %>%
count() %>%
drop_na() %>%
hchart("bar",hcaes("barriers","n")) %>%
hc_title(text = "Barriers preventing from making easier to reuse and reproducible work") %>%
hc_add_theme(hc_theme_darkunica())ML unfair Bias evaluation metrics too.